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Abstract 

This paper applies the recently axiomatized Optimum Information 
Principle (minimize the KuUback-Leibler information subject to all rel- 
evant information) to nonparametric density estimation, which provides a 
theoretical foundation as well as a computational algorithm for maximum 
entropy density estimation. The estimator, called optimum information 
estim,ator, approximates the true density arbitrarily well. As a by-product 
I obtain a measure of goodness of fit of parametric models (both condi- 
tional and unconditional) and an absolute criterion for model selection, as 
opposed to other conventional methods such as AIC and BIG which are 
relative measures. 

1 Introduction 

This paper applies the Optimum Information Principle recently axiomatized by 
Toda (2011) by refining Jaynes's axioms of plausible reasoning (Jaynes, 2003, 
Chapters 1 and 2) to nonparametric density estimation. The optimum infor- 
mation principle, which is more fundamental than Bayesian inference (Bayes, 
1763), Fisher's maximum likelihood principle (Fisher, 1912),^ Jaynes's maxi- 
mum entropy principle (Janyes, 1957), and KuUback's principle of minimum 
discrimination information (Kullback, 1959, p. 37), prescribes to minimize the 
information gain (measured by the Kullback-Leibler information (Kullback and 
Leibler, 1951)) of updating from a prior to posterior subject to all relevant 
information. In particular, for nonparametric density estimation, it means to 
maximize the Shannon entropy (Shannon, 1948) of a density subject to sample 
moment constraints. Such an information-theoretic approach to density esti- 
mation has been known (see Wu (2003) and the references therein) but has not 
yet gained popularity in econometrics possibly due to the lack of a theoretical 
foundation as well as a simple algorithm of computation. This paper provides 
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both. As a by-product I obtain a measure of goodness of fit of parametric mod- 
els and a criterion for model selection, which is closely related to BIC (Schwarz, 
1978) but distinct from other conventional methods such as AIC (Akaike, 1974) 
and BIC in that it is an absolute criterion, not relative. AIC and BIC can only 
pick the best model among the competing ones, but it may be the case that all 
models are poor. Our measure tells the poor fit if all models are indeed poor. 



2 Optimum Information Estimator 

Suppose that {^n} sltc i.i.d. random variables taking values in H^, with an 
unknown density f{x). Let {xn}^^i be the realization of {Xn}. Our task is 
to obtain a reasonable estimate / of / from the data. Since the data { a;„ } is a 
finite set, it is compact and discrete. Therefore there is no reason to believe that 
the true distribution / has an unbounded support or that / is discontinuous. 
Hence throughout the paper let us assume that / is continuous and compactly 
supported on S* C B{R), where B{R) denotes the closed ball with radius R > 
max„ ||a;„|| with center at the origin. 

Sample moments Sample moments of / are computed by rhi = '^n=i ^ni, 
rhij = ^n=i ^ni^nj: ctc. In general let us introduce the multi-index (FoUand, 
1999; p. 236) of nonnegative integers a = (ai, . . . , ax) and let the a-th sample 
moment denoted by 

AT N 
n—l n—1 

We set \a\ = X^fcLi \'^k\, a! = ai\ ■ ■ ■ a^l, ctc. Even more generally, if T : 
S — >■ is a Lebesgue measurable function, the sample moments of T can be 
defined by T := J2n=i T{xn)- The function T represents the moments that 
the econometrician considers relevant for inference. If there is no particular 
reason to choose otherwise, it is natural to set T{x) = {x°')\a\<Ai where A is 
typically an even integer from 2 to 10. 



Optimum information estimator defined The optimum information es- 
timator (OIE) of /, denoted by /, is defined by 

/ = argmin / g{x)\ogg{x)dx 

g6Li(S) Js 

subject to / T{x)g{x)dx = f, / g{x)dx = 1, (2.1) 
Js Js 

where dx denotes the Lebesgue measure and g log g is the Kullback-Lcibler 
information of the density g with respect to the uniform prior on S.^ Hereafter 
all integrations are carried out on the compact set S and therefore we drop the 
subscript S from the integral sign. Minimizing the information as in (2.1) is 
optimal from an information theoretic as well as a Bayesian point of view. See 
Toda (2011) for a theoretical justification of this definition. 

^By convention we set OlogO = and x-logx' = oo if a; < 0. 
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Computing the optimum information estimator The minimization prob- 
lem (2.1) is a special case of an entropy-like minimization problem, which can 
be solved by Fenchel duality.'^ Since / g = 1, (2.1) is equivalent to 



subject to j T{x)g{x)dx ^f, j g{x)dx = I. (P) 

By Corollary 2.6 and Example 5.6(ii) of Borwein and Lewis (1991), the dual 
problem of (P) is given by 



max 

zGlR,AeIR^ 



z + . 



A'f - / c^+^'^(-)dx 



(D) 



I assume that a regularity condition of the Fenchel duality theorem holds and 
therefore the dual problem (D) has a solution and the primal and dual value 
coincide. One such regularity condition is that T belongs to the interior of 
T{S) = {T{x) I X G 5}, which is very weak.^ Since the objective function in 
(D) contains an integral over the compact set S*, it is always finite and 
in (z,A) (Toda, 2010, Proposition B.5). Since (2, A) are unconstrained, the 
maximum is obtained by differentiating and setting equal to zero. Partially 
differentiating (D) with respect to z, we obtain 
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3^+^'^(-)d.T = 4^ z = -\og(^J c^'^^-Mx^ 



Substituting this into (D), and using the Fenchel duality theorem (Borwein and 
Lewis, 1991, Corollary 2.6), after some algebra we obtain: 

Theorem 1 (Fenchel Duality). 

^min := Minimized KuUback-Leibler information 

Tg = f, /.g = l| 

(2.2) 



max 



A'f - log e^'^(-)da; 



The duality theorem 1 was first exploited by Gibbs (1902),^ although its 
full understanding had to wait for the development of modern convex analysis 
half a century later. The reduced dual problem (2.2) has a unique solution A if 
dimr(S') = L {i.e., T{S) is not contained in a lower dimensional affine space) 

because in that case the function (/)(A) = log (^J ^'^■^^dx^ is strictly convex 

(Toda, 2010, Proposition B.4). In most applications the maximization problem 
(2.2) has no closed-form solutions. However, since the objective function is 
differentiable and concave (Toda, 2010, Proposition B.5), a numerical solution 



^For a good account of Fenchel duality, sec RockafcUar (1970), Borwein and Lewis (2006) 
(finite-dimensional), and Luenberger (1969) (infinite-dimensional). 

*See Bo^ and Wanka (2006) for a nice discussion on regularity conditions. 
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A can be easily obtained by the Newton-Raphson algorithm (Luenberger, 1969, 
Chapter 10). 

Differentiating the objective function in (2.2) with respect to A and setting 
equal to zero, A satisfies T = 'Ei^\T{X)\ = J T{x)f{x)dx, where 

f(x) = -A . (2.3) 

The function / is the optimum information estimator. To verify this, substitute 
/ defined by (2.3) for g in (2.2) and we can see that the equality is satisfied. 

The functional form in (2.3) shows that the optimum information estimator 
/ belongs to an exponential family, and one might guess that the Lagrange 
multiplier A is the maximum likelihood estimator of that family. This conjecture 
is indeed true. 

Proposition 2. X is a maximum likelihood estimator^ for the exponential family 
{ /(^; A) }xem.. , ^here fix; A) « e^'^(-) . 

Proof. The log likelihood of the model /A(a;) := '^(^)/ J "^^^^dx is 

N N 



log/:(A) = f]log/A(x„) = J2 ^'^(^») - log (/ e^'^^'^dx) 

n=l n=l L V"' / 

= N A'f -log (^y"c^'^("Mx^ 



which is precisely the expression in (2.2) multiplied by the sample size N. □ 

We should not misinterpret this result such that the optimum information 
principle is a special case of maximum likelihood, however, for two reasons. 
First, I showed (Toda, 2011, Theorem 5) that maximum likelihood is (approxi- 
mately) implied by the optimum information principle: we should therefore in- 
terpret the above result such that a particular optimum information estimator 
may coincide to the density corresponding to a particular maximum likelihood 
estimator. Second, and more importantly, maximum likelihood is valid only if 
the model contains the truth (i.e., f\ = f for some A), but a model is never true, 
as Box (1976) puts "Since all models are wrong the scientist cannot obtain a 
"correct" one by excessive elaboration" . Hence the maximum likelihood here is 
actually the quasi- maximum likelihood (Huber, 1967). On the other hand, from 
a Baycsian point of view the optimum information principle makes no reference 
to the truth. 

Let us summarize the above results in a theorem and an algorithm to com- 
pute the optimum information estimator: 

Theorem 3. Let Xn ^ f , i.i.d. with f compactly supported on S C TR^ , { x„ } 

their realizations, T : S* — > be Lebesgue measurable, 

T := '^n=i ^(^n)' ^ £ int T(S'), and dimr(5) = L. Then 

1. there exists a unique optimum information estimator f defined by (2.1), 



^Morc precisely, it is a quasi-maximum likelihood (QML, see Huber (1967)) estimator 
because the model is misspecified. 
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2. f{x) oc '^(^) ^ where A is the maximum likelihood estimator of the expo- 
nential family /(a;; A) oc e'*' 

3. A can he computed by the Newton-Raphson algorithm. 

Algorithm (Computation of tlie optimal information estimator). 

Step 1. Choose the relevant support S and moments T : S ^ K'^ to exploit. 
Choose S = B{R) with R > max„ ||a;„|| and T{x) = {x")\a\<A, where 
A is typically 10, unless there is a strong reason to do otherwise. 

Step 2. For each A = 2, 4, . . . , A, compute the optimal information estimator 
by the Newton-Raphson algorithm (Theorem 3). 

Step 3. Since the optimum information estimator corresponds to the maximum 
likelihood distribution of an exponential family (Proposition 2), it is 
natural to use BIC (Schwarz, 1978) to select the best estimate among 
A = 2,4, . . . , A. This is the final optimum information estimator. 

It is acceptable to simplify Step 2 by estimating only one density (corre- 
sponding to A) and omitting Step 3. (More discussion on BIC is given in 
Section 3.) 

Estimating conditional densities Economists are usually interested in the 
conditional density of a variable Y (e.g., income) conditional on some other 
variables X {e.g., sex, age, education, experience, etc.) or the conditional ex- 
pectation E[y|X] rather than the unconditional distribution. Estimation of 
these quantities arc straightforward. For instance, the conditional density of 
Y given X is estimated by f{y\x) :~ -^j-^'^-* , and the conditional expectation is 

estimated by t][Y\X] / yf{y\x)dy. 

Asymptotic properties of the optimum information estimator Since 
the quasi maximum likelihood estimator is a special case of an M-estimator 
(van der Vaart, 1998, Chapter 5), as the sample size gets large, A converges 
in probability to the A that solves the population counterpart of (2.2)^, and so 
does the KuUback-Leibler information: 

j f\ogl^:Hif;f)^Hif;fx) := J flogj-. 

The quantity 2NH{f; f) is asymptotically distributed as noncentral with L 
degrees of freedom and noncentrality parameter 2NH{f; fx) (Kullback, 1959, 
pp. 97-107). Now we show that the optimum information estimator / asymp- 
totically approximates the true distribution / arbitrarily well, but we need a 
lemma first. To avoid unnecessary complication I assume that the true density 
/ is positive everywhere on its support. 

Lemma 4. Let f be a positive, continuous density on the compact set S C K^. 
Then there exists A > such that the exponential family f{x;X) oc e^ 
where T(x) = {x")\a\<A, contains a density /o that arbitrarily approximates f 
uniformly over S . 

^This quantity is known as the pseudo-true value (Cameron and Trivedi, 2005, p. 146). 
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Proof. Since / is positive and continuous on S, it has a positive minimum and 
log/ is continuous on S. By the Stone- Weierstrass theorem (Folland, 1999, 
p. 139), we can take A € and C > such that |log7(a;) - X'T{x) - logC| is 



arbitrarily small on S. Then we can make 



the density, we obtain \f{x) 
exponential family. 



f-Ce 



A'T 



small. By normalizing 



/o(a;)| < e for all a; e S* for some /o in the 

□ 



Theorem 5. Let f be a positive, continuous density on the compact set S C 
M^'. For any e > 0, there exists a number A > such that for any 5 > 0, 
the optimum information estimator f corresponding to the moments T{x) — 
ix°')\a\<A satisfies 



N- 



lim Pr iJ(/;/) >e + S 



0. 



Proof. Let {fx} be exponential family in Lemma 4 and / be the optimum 
information estimator of / using the moments |a| < A. Take /o £ {fx} that 
uniformly approximates /. Since H{f;f) A- H{f]fx), where fx minimizes 
the Kullback-Leibler information among the exponential family {fx}, we get 
H{f; fx) < H{f; /o). Hence we only need to show that we can choose A such 
that H{f; /o) < e. Since /, /o are positive on S, we get 



H{f; /o) = - / ./ log y = - / ./ log 



= -// 
1 

2/ 



.h-f Ufo-f 



f 2 V / 
(/o - f)' + oiifo f?) 



/o - ./ 
/ 



+ o{{fo-ff) 



which can be made arbitrarily small by Lemma 4. 



□ 



3 Goodness of Fit and Model Selection 

This section is an application of the optimum information estimator to evaluate 
the goodness of fit of parametric models and select the best fitting model. I 
consider two cases separately, models for the unconditional density and the 
conditional density. 



Unconditional models Suppose that we have M competing models denoted 
by ~ { 1, 2, . . . , Af }, where model m has a parametric density fm{x;0m) 
with parameter G 0,„. Given data, how should we choose between these 
models, and how should we evaluate the goodness of fit? 

Ideally, by the optimum information principle the goodness of fit should be 
evaluated by the minimized Kullback-Leibler information, 

^■"^"^^^.l/^(^')^°Sy^dx, (3.1) 

where { f{x;0)}g^Q is a particular parametric model (Toda, 2011, Theorem 
5). However, this approach is infeasible because we do not know the true /. 
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As an approximation, in his seminal paper Akaike (1974) approximated the 
quantity J / log /(a;; 0), which appears in the expansion of (3.1) and denotes 
the optimum parameter value, and derived his celebrated Akaike Information 
Criterion (AIC). 

We take a different approach. Since the optimum information principle im- 
plies Bayesian inference (Toda, 2011, Theorem 4), which is an exact implication 
as opposed to the approximate implication for maximum likelihood (Toda, 2011, 
Theorem 5), the Bayesian approach to model selection by Schwarz (1978) can 
be fully justified from an information-theoretic point of view. Now the opti- 
mum information estimator / in (2.3), being a maximum likelihood distribution 
of a particular exponential family (Theorem 3), is also a particular paramet- 
ric model. Therefore the goodness of fit and model selection of the competing 
models M. — {1,2,...,A'/} reduces to the model selection of { } U A^, where 
model corresponds to the exponential family in Proposition 2 that generates 
the optimum information estimator (2.3). Model serves as our benchmark 
model. 

An approximation of the evidence^ (the logarithm of the Bayesian likelihood) 
of model m is given by (Schwarz, 1978, p. 461)^ 

:= log £,„ {§,n) - ^K„, log N, (3.2) 

where Cm is the log likelihood of model m, the maximum likelihood esti- 
mator, Km the number of unknown parameters (the dimension of 9m), and 
the sample size.^*^ In particular, for the benchmark model 0, we have 
Kq ~ dimr(5) = L. The larger the evidence E is, the better the model 
fits. 

By Laplace's Principle of Indifference (Laplace, 1812) (which is a particular 
axiom of plausible reasoning in Toda (2011)), let us assign prior probability 
jjij-j- to models m = 0, 1, . . . , M. Then, by Schwarz's fundamental proposition 
(Schwarz, 1978, p. 462) and Bayes's rule (which is implied by the optimum 
information principle (Toda, 2011, Theorem 4)), the posterior probability of 
model m given data D = { a;„ } is approximately given by 

P{m\D) ^ry . (3.3) 

l^m=0 

P{m\D) is a measure of model fit. If P{m\D) is large for some m = 1,2,..., M, 
then it is a good model. If P{Q\D) is large, then all models are poor. The 
fundamental difference between our evidence E and the posterior model proba- 
bility P{m\D) and other information criteria such as AIC and BIC is that while 
AIC and BIC are relative measures of model fit, evidence E and P{m\D) are 
absolute measures. By using AIC and BIC we can select the best model among 
the competing models, but it might be the case that all models are poor. On 
the other hand, our approach tells us each model's absolute fit. 



*This term is due to Jaynes (1956). 

^Strictly speaking, Schwarz (1978) proved (3.2) only for the exponential family, but since 
any continuous, compactly supported density can be arbitrarily approximated by an exponen- 
tial family (Lemma 4), (3.2) is true for any such density. 

^•^I was tempted to call the quantity in (3.2) TIC, for the obvious reason, but this acronym 
is reserved for the Takeuchi Information Criterion (Takeuchi, 1976). Besides, the evidence is 
— 1/2 of BIC and hence there is no reason to introduce a new name. 
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Conditional models Suppose now that the models are conditional, meaning 
that they only describe a conditional density ,f{y\x; 0), which is more important 
in economics. The optimum information principle gives us a measure of goodness 
of fit and model selection in this case, too. The evidence E,n in (3.2) is changed 
to 

Em \ogCrrXOra;Y\X)+\ogC{X) - ^i^™ log TV, (3.4) 

where log £,„(6',„; y |X) denotes the maximized conditional log likelihood of 
model m and \ogC{X) denotes the log likelihood of data X = {a;„}. Since 
the true density of { } is unknown, \ogC{X) is a nuisance parameter. But 
since it is common across all models, if we substitute in (3.4) into the funda- 
mental formula (3.3), exp(log£(X)) = C{X) in the numerator and denominator 
cancel out (!). Therefore the posterior model probability P{'m\D) can just be 
computed by using the maximized conditional log likelihood. Let us summarize 
this in a theorem: 

Theorem 6 (Goodness of fit and model selection). Let Ai ~ { 0, 1, 2, . . . , M } 
be M competing models with parametric conditional density fm{y\x]9ra), where 
model is the benchmark model corresponding to the optimum information es- 
timator f{y\x) := ^f^'^? ■ Let 9m be the maximum likelihood estimator of model 
m and define the conditional evidence of model m > by 

Em := \0gCmidr,.;Y\X) ~ ^KmlogN, 

where Km — dim 9m and N is the sample size, and 

Eo := log C{9x,Y ;X,Y)- log C{9x;X)-^ {Kx.y - Kx) log iV, 

where log C{9z] Z) denotes the maximized log likelihood of the exponential family 
corresponding to the optimum information estimator for the density f{z) (z = 
X or z = (cc, y) ), and Kz is its parameter dimension. Then, the posterior 
probability of model m is 

P{m\D) = . 

Our information-theoretic approach to the goodness of fit of conditional dis- 
tribution is much simpler than the frequentist approach such as Andrews (1997), 
Fan (1998), and Delgado and Stute (2008) since it requires no bootstrapping, no 
kernel density estimation, or no complicated integration. All we need is maxi- 
mum likelihood estimation. Another advantage is that the frequentist approach 
can only test a null hypothesis, thereby accepting or rejecting a model, but our 
approach can evaluate an arbitrary number of models simultaneously. A related 
method to our approach of model comparison is the likelihood ratio test (see 
Wilks (1938) for nested models and Vuong (1989) for non-nested models), but it 
only applies to two models and it provides no information for the goodness of fit. 
The information-theoretic approach is applicable to any number of non-nested 
models and gives an absolute measure of goodness of fit. 
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AIC or BIC? I used BIG (Schwarz, 1978) to define the evidence of a model 
in (3.2). However, tiiere are otlier information criteria such as AIC, AICc, BIC, 
CIC, Die, EIC, QIC, TIC, etc. (see Burnham and Anderson (2004) and An- 
derson (2008)), among which AIC (AICc) and BIC are by far the most applied. 
Anderson (Anderson, 2008, p. 160) dismisses BIC as having "nothing linking 
it to information theory" , which is incorrect, but Anderson's book was written 
before my discovery (Toda, 2011). I believe that BIC is the most fundamen- 
tal concept because Baycsian inference is an exact implication of the optimum 
information principle and Schwarz (1978) uses only one approximation to de- 
rive BIC (approximating log likelihood), whereas the approach of Akaike (1974) 
is conceptually closer to the optimum information principle but it invokes two 
approximations (approximating the Kullback-Lcibler information and log likeli- 
hood). But let us not be dogmatic: it is equally acceptable to use AIC or AICc 
(Sugiura, 1978), which are also derived by information theory. 

4 Performance of OIE with Real Data 

To the best of my knowledge, the only applications of the maximum entropy 
estimation in economics and finance are Wu (2003) and the references therein. 
The estimated maximum entropy density superimposed on the histogram of 
1999 U.S. family income in Figure 1 of Wu (2003) shows a good fit, as pre- 
dicted by the theory (Theorem 5). Maximum entropy estimation (information- 
theoretic estimation) is much more popular outside economics, in particular 
physics. Jaynes (Jaynes, 2003, p. 125) mentions that the Bayesian analysis 
(synonymous to information-theoretic analysis) of nuclear magnetic resonance 
(NMR)^^ data obtained by his student (Bretthorst, 1988) showed an orders of 
magnitude (i.e., at least 10 fold) improvement of resolution over Fourier trans- 
forms method which was conventional at the time, and because of this surprising 
improvement Bretthorst 's result was not believed for a long time. 

The value of the information-theoretic nonparametric estimation (and model 
selection) method proposed in this paper compared to conventional methods 
such as kernel density estimation should ultimately be judged by their relative 
performances in analyzing real data. However, there are a few reasons to believe 
that the optimum information estimator is superior: 

1. The optimum information principle fully exploits the available informa- 
tion as opposed to other ad hoc methods. For instance, kernel density 
estimation is essentially a local linear regression and hence uses only local 
information. 

2. Kernel density estimation has a lot of arbitrariness with regard to the 
choice of the kernel and the bandwidth, whereas the only arbitrariness in 
the information-theoretic density estimation is the number of moments to 
include as constraints. Even this arbitrariness can be removed by selecting 
the optimal number of constraints by BIC. 

3. Since the optimum information estimation reduces to the maximum like- 
lihood estimation of an exponential family (in the present case), it is fully 

^^NMR is applied in medicine for making 2D and 3D images of the inside of the human body 
for diagnostic purposes, which is known as magnetic resonance imaging (MRI). ("Nuclear" is 
dropped because it is not a politically correct word.) 
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parametric, computationally straightforward, and free from the curse of 
dimensionality. 



5 Concluding Remarks 

Statistics and econometrics are sciences of extracting information from data. 
Hence an inference method is valuable if and only if it is useful in analyzing 
real data, and therefore an inference method requires no interpretation, and 
no justification except practical usefulness: we should refrain from being too 
dogmatic as exemplified by the heated frequentist/Bayesian debate in the past. 
Comparisons of the performance of the optimum information estimator and 
other methods using real data are welcome, although it is beyond the scope of 
the present paper. 

References 

Akaike, H. (1974): "A New Look at the Statistical Model Identification," 
IEEE Transactions on Automatic Control, AC-19, 716-723. 

Anderson, D. R. (2008): Model Based Inference in the Life Sciences: A 
Primer on Evidence, NY: Springer. 

Andrews, D. W. K. (1997): "A Conditional Kolmogorov Test," Econometrica, 
65, 1097-1128. 

Bayes, R. T. (1763): "An Essay toward Solving a Problem in the Doctrine of 
Chances," Royal Society Philosophical Transactions, 53, 370-418. 

Box, R. I. AND G. Wanka (2006): "A Weaker Regularity Condition for Sub- 
differential Calculus and Fenchel Duality in Infinite Dimensional Spaces," 
Nonlinear Analysis, 64, 2787-2804. 

BORWEIN, J. M. and a. S. Lewis (1991): "Duality Relationships for Entropy- 
like Minimization Problems," SIAM Journal of Control and Optimization, 29, 
325-338. 

(2006): Convex Analysis and Nonlinear Optimization: Theory and Ex- 
amples, Canadian Mathematical Society Books in Mathematics, New York: 
Springer, second ed. 

Box, G. E. P. (1976): "Science and Statistics," Journal of the American Sta- 
tistical Association, 71, 791-799. 

Bretthorst, G. L. (1988): Bayesian Spectrum Analysis and Parameter Esti- 
mation, vol. 48 of Lecture Notes in Statistics, Berlin: Springer. 

BuRNHAM, K. P. AND D. R. Anderson (2004): "Multimodel Inference: Un- 
derstanding AIC and BIC in Model Selection," Sociological Methods and Re- 
search, 33, 261-304. 

Cameron, A. C. and P. K. Trivedi (2005): Microeconometrics: Methods 
and Applications, New York: Cambridge University Press. 



10 



Delgado, M. a. and W. Stute (2008): "Distribution-free Specification Tests 
of Conditional Models," Journal of Econometrics, 143, 37-55. 

Fan, Y. (1998): "Goodncss-of-Fit Tests Based on Kernel Density Estimators 
with Fixed Smoothing Parameters," Econometric Theory, 14. 604-621. 

Fisher, R. A. (1912): "On An Absolute Criterion for Fitting Frequency 
Curves," Messenger of Mathematics, 41, 155-160. 

FOLLAND, G. B. (1999): Real Analysis: Modern Techniques and Their Appli- 
cations, Hoboken, NJ: John Wiley & Sons, second ed. 

GiBBS, J. W. (1902): Elementary Principles in Statistical Mechanics Developed 
with Especial Reference to the Rational Foundation of Thermodynamics, New 
York: Charles Scribner's Sons. 

HuBER, P. J. (1967): "The Behavior of Maximum Likelihood Estimates under 
Nonstandard Conditions," in Proceedings of the Fifth Berkeley Symposium on 
Mathematical Statistics and Probability, ed. by J. Neyman, vol. 1, 221-233. 

Janyes, E. T. (1957): "Information Theory and Statistical Mechanics, I," 
Physical Review, 106, 620-630. 

Jaynes, E. T. (1956): "Probability Theory in Science and Engineering," in 
Colloquium Lectures in Pure and Applied Science, USA: Socony-Mobile Oil 
Co., 4. 

(2003): Probability Theory: The Logic of Science, Cambridge, U.K.: 

Cambridge University Press, edited by G. Larry Bretthorst. 

KuLLBACK, S. (1959): Information Theory and Statistics, New York: John 
Wiley & Sons. 

KuLLBACK, S. AND R. A. Leibler (1951): "On Information and Sufficiency," 
Annals of Mathematical Statistics, 22, 79-86. 

Laplace, P. S. (1812): Theorie Analytique des Probabilites, Paris; Courcier. 

LuENBERGER, D. G. (1969): Optimization by Vector Space Methods, Ney York: 
John Wiley & Sons. 

ROCKAFELLAR, R. T. (1970): Convex Analysis, Princeton, NJ; Princeton Uni- 
versity Press. 

SCHWARZ, G. (1978): "Estimating the Dimension of a Model," Annals of Statis- 
tics, 6, 461-464. 

Shannon, C. E. (1948); "A Mathematical Theory of Communication," Bell 
System Technical Journal, 27, 379-423, 623-656. 

SuGiURA, N. (1978): "Further Analysis of the Data by Akaike's Information 
Criterion and the Finite Corrections," Communications in Statistics- Theory 
and Methods, A7, 23-26. 



11 



Takeuchi, K. (1976): "Distribution of Informational Statistics and a Crite- 
rion of Model Fitting," Suri-Kagaku (Mathematical Sciences), 153, 12-18, 
(In Japanese). 

TODA, A. A. (2010): "Existence of a Statistical Equilibrium for an Economy 
with Endogenous Offer Sets," Economic Theory, 45, 379-415. 

(2011): "Unification of Maximum Entropy and Bayesian Inference via 

Plausible Reasoning," Submitted to IEEE Transactions on Information The- 
ory, http : //arxiv . org/abs/1103 . 2411. 

VAN DER Vaart, A. W. (1998): Asymptotic Statistics, Cambridge University 
Press. 

VUONG, Q. H. (1989): "Likelihood Ratio Tests for Model Selection and Non- 
Nestcd Hypotheses," Econometrica, 57, 307-333. 

WiLKS, S. S. (1938): "The Large-Sample Distribution of the Likelihood Ratio 
for Testing Composite Hypotheses," Annals of Mathematical Statistics, 9, 
60-62. 

Wu, X. (2003): "Calculation of Maximum Entropy Densities with Application 
to Income Distribution," Journal of Econometrics, 115, 347-354. 



12 



